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Abstract 

We consider visible compression for discrete memoryless sources of mixed quantum states when only 
classical information can be sent from Alice to Bob. We assume that Bob knows the source statistics, and 
that Alice and Bob have identical random number generators. We put in an information theoretic framework 
some recent results on visible compression for sources of states with commuting density operators, and 
remove the commutativity requirement. We derive a general achievable compression rate, which is for the 
noncommutative case still higher than the known lower bound. We also present several related problems of 
classical information theory, and show how they can be used to answer some questions of the mixed state 
compression problem. 

Index Terms - quantum information theory, data compression, mixed-state sources. 
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I. Introduction 

A discrete memoryless source (DMS) of information produces a sequence of independent, identically dis- 
tributed random variables taking values in a finite set called the source alphabet. In quantum systems, 
source letters are mapped into quantum states for quantum transmission or storage. In the simplest case, 
quantum states correspond to unit length column vectors in a d-dimensional Hilbert space Ha- Such quan- 
tum states are called pure. When d = 2, quantum states are called quhits. A column vector is denoted by 
|cp), its transpose by ((p|. A pure state is mathematically described by its density matrix equal to the outer 
product |cp)((p|. In a more complex case, a quantum state can be any of a finite number of possible pure 
states |cpi) with probabihty pi. Such quantum states are called mixed. A mixed state is also described by its 
density matrix which is equal to IItPi|(Pi)(cpi|. Note that a density matrix is a d x d Hermitian trace-one 
positive semidefinite matrix. A classical analog to a mixed state can be a multi-faced coin which turns up 
as any of its faces with the corresponding probability. 

Compression algorithms deal with source sequences rather than individual letters. There are two possible 
scenarios for which algorithms can be designed: visible when the encoder Ahce knows the source sequence 
and blind when only the quantum state corresponding to the sequence is available to her. The quantum 
state corresponding to a source sequence of length n has a d^ x d^ density matrix, equal to the tensor 
product of density matrices corresponding to the letters in the sequence. In the bhnd case, lossless quantum 
compression algorithms map (encode) these product states into states over Hilbert spaces of smaller dimension 
with arbitrarily high expected reconstruction (decoding) fidelity as n — > oo. Operations used for encoding 
and decoding have to be allowed by quantum mechanics. In the visible case, Ahce can as well compress the 
available classical information, which the decoder Bob can use to prepare a quantum state that (as in the 
bhnd case) approximate Ahce's with arbitrarily high expected fidehty as n — > oo. 

The main question asks what the best compression compatible with the fidehty goal and encoding/decoding 
constraints for each scenario is. The answer to the question was given by Schumacher for discrete memoryless 
sources of pure quantum states Lossless compression of sources of possibly mixed quantum states is not 
yet fully understood, and is the subject of current research ||2[-[0]. The optimal compression rate for the 
bhnd case scenario was found by Koashi and Imoto in [0. A lower bound to the compression rate was 
estabhshed by Horodecki in and by Barnum, Caves, Puchs, Jozsa, and Schumacher in [Q]. The optimal 
compression rates for some special cases were found by Horodecki in and by Barnum, Caves, Fuchs, Jozsa, 
and Schumacher in Q. More recently, an algorithm achieving the lower bound to the compression rate for 
the visible case of states with commuting density operators was presented by Diir, Vidal, and Cirac in [||, 
and a possibly related classical information theory problem was discussed by Kramer and Savari in ||6|. Some 
of these results will be addressed in more detail after the problem we are deahng with is precisely formulated. 

We are concerned with visible compression of discrete memoryless sources when only classical information 
can be sent from Ahce to Bob. We assume that Bob knows the source statistics, and that Alice and Bob 
have identical random number generators. This scenario is the one studied by Diir, Vidal, and Cirac for the 
case of states with commuting density operators p. When put in an information theoretic framework, the 
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commutativity requirement can be easily removed, and an achievable rate can be found in the same manner. 
However, the derived achievable rate is still higher than the lower bound. 

In the second part of the paper, we present several related problems of classical information theory, and 
show how they can be used to answer some questions of the mixed state compression problem. This paper is 
written for both information theorists and physicists, although papers written for two audiences often satisfy 
neither. Here writing for these two groups of scientists merely means that we tried to keep the paper as 
self contained as possible, and presented proofs and other material in an elementary rather than the most 
efficient way. 

A. Problem Formulation 

Let ^ be a finite set (alphabet), and {pali G ^} a set of (possibly mixed) quantum states in a d-dimensional 
Hilbert space Hd- Let V[X) be the set of all probability distributions on X, and P G V[X) a particular 
distribution. The set £ = {pa, P(a)|a G X} is usually referred to as an ensemble of mixed states indexed by 
the elements of Pd. The density matrix of the ensemble £, which we shall also refer to as the source density 
matrix, is given by 

p = ^P(a)pa. (1) 

We shall assume that states Pa are mixtures of known (possibly nonorthogonal) pure states as follows: Let 
3^ be a finite set, and {|il^b)(4'bl|b G 3^} be a set of pure quantum states in Tid indexed by the elements of 
y. Let W be an \X\ x \y\ stochastic matrix with elements Wab = W(b|a), a e X, Id e y, where W(-|a) is 
a probabihty distribution on 3^ for each a e X. We assume that no two states Pa are identical in the sense 
that no two rows of W are identical. The density matrices in £ are given by 

Pa='^W[MaUb){^\,^\, a ex. (2) 

bey 

A source producing mixed states Pa, a e X, independently according to the probability distribution P, 
effectively produces pure states |TjJb)(4'bl, b G 3^, independently according to the probability distribution Q: 

Q(b) = ^P(a)W(b|a). 

Thus the source density matrix (|l|) can also be expressed in terms of |il^b)(4'bl and Q(b), b G 3^: 

p = ^Q(b)|Tj;b)(il;bl. 

bey 

Example 1: A possible mixed state ensemble is shown in Fig. [l[ Here d = 2, \X\ = 2, and 13^1 = 3. 

The memoryless source produces sequences of letters, where each letter is drawn from the set X indepen- 
dently according to the probability distribution P. Thus a source sequence x = (xi , . . . , x^) G X^ occurs with 
probability P(x) = P(xi) • . . . • P(xn), and the corresponding state has a density matrix Px = Px, • • • Pxn- 
On the transmitting end, the encoder Ahce knows £ and x. On the receiving end, the decoder Bob knows 
£. In addition, Alice and Bob have identical random number generators. 
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Pi = i\^^){A>^\ + i\A>2){A>2\ 

P = IP1 + 2P2 

= -^\^^){^^\ + l\^2){^\>2\ + l\^3){^\>3\ 

= \l 



\A>^ 



1^1 




-1/2 
V3/2 



Fig. 1 

A MIXED STATE ENSEMBLE. 



For each source sequence x, Alice prepares and sends to Bob Rn bits of classical information, which he 
uses (together with his prior knowledge oi £) to prepare state Px- To measure how faithfully mixed state a 
approximates mixed state cu and vice versa, we use the so called mixed state fidelity F defined as 

F(a,cu) = {Tr[(^tuv^)l/2]}^ (3) 

whose maximum value is 1. We shall say that the mixed state compression is lossless when the expected 
value of F(px, Px) ^^'"^ made arbitrarily close to 1 by increasing the length n of the source sequence: 

Y_ P(x]F(px, px) ^ 1 as n ^ oo. (4) 

B. Information Measures 

In compression of mixed-state sources by sending classical information, the well known classical information 
measures will play a role. Entropy H(Q), conditional entropy H(W/P), and mutual information I(P, W) are 
defined as 

H(Q] = -^Q(b)(a)logQ(b) 

bey 

H(W/P) = - ^ P(a) ^ W(b|a) log W(b|a) (5) 
I(P,W) = H(Q) -H(W/P) 
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The corresponding quantum information measures are the source Von Neumann entropy S(p), the expected 
value of the Von Neumann entropies of the source letters S, and the Holevo quantity x: 

S(p) = -Trplogp 

S= _^P(a)S(pa) (6) 

X = S(p)-S 

When jil^b)(4'bl, b e 3^, are orthogonal, the quantum quantities (|) and their classical counterparts (|) are 
equal: 

S(p)=H(Q) 
S = H(W/P) 
X = I(P,W) 

For the classical information theory problems discussed in Sec. 0, we also need stochastic matrix U with 
elements Ut,a = U(a|b), a G ^, b G J', where U(-|b) is a probability distribution on X for each h ^y. The 
elements of U are computed as 

U(a|b) =P(a)W(b|a)/Q(b). 

Entropy H(P), conditional entropy H(U/Q), and mutual information I(Q, U) are defined as the corresponding 
quantities in (|). 

C. Known Results 

For sources of pure quantum states, the optimal compression rate is S(p) for both visible and blind 
scenarios; the information sent from Alice to Bob is quantum Q. For sources of mixed quantum states 
and the fidelity criterion (|), the following has been shown: The Von Neumann entropy S(p) is the optimal 
compression rate in the bhnd case scenario [0]; the compression algorithm is the same as in the pure case 
state. A lower bound to the compression rate of any compression scheme is the Holevo quantity x (Ml) [@|- 
This lower bound can be achieved by a specific compression algorithm in the case of quantum states with 
commuting density operators HI; the information sent from Ahce to Bob is classical. Achievable compression 
rates for both visible and bhnd scenarios for sources of quantum states with commuting density operators 
and a fidelity criterion different than are found in (see Sec. [VI- B| ). 



When the density matrices Pa, a G A", commute, they can be made diagonal in the same basis. Thus, one 
can assume that they are mixtures of orthogonal pure states |i|'b)(U'bl, b G 3^. We address the general case, 
i.e., the one where the |4>b)(4'bl, b G 3^, are not necessarily orthogonal. 
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D. The Idea for the Compression Algorithm 

The main idea is simple to state for the reader already familiar with the notion of typicality as well as the 
notion of joint and conditional typicality. A rigorous description, given in the proceeding sections, uses the 
precision provided by the method of types. 

For each x, Alice's state px is roughly a uniform mixture of pure states |'^y)(¥y| = |ilJy,)(4>y, \ ■ ■ ■ 
\'^yn){A>yn\ where y is conditionally W- typical with respect to x, and some unlikely pure states. For each 
P-typical X, there are about exp[nH(W/P)] such ys, and they are Q-typical. There are about exp[nH(Q)] 
Q-typical ys, and a randomly chosen y will be conditionally W- typical with respect to any P-typical x 
with probability of about exp[nH(W/P)]/exp[nH(Q)] = exp[-nI(P, W)]. Therefore, if Bob forms a list of 
exp[nI(P, W)] randomly chosen Q-typical ys, then with high probabiUty there will be a conditionally W- 
typical y with respect to any P-typical x Ahce may have. If Alice and Bob use identical random number 
generators to form a hst, Ahce (who knows x) can identify such y to Bob by sending about nI(P, W) bits 
of classical information. Bob can then prepare the corresponding |Vy)(¥y|, or an error state if no W-typical 
y was on the Ust. Therefore, for every P-typical x, Bob's state is with high probabiUty also a uniform 
mixture of pure states |Vy)(Vy| where y is conditionally W-typical with respect to x and an unhkely error 
state. 

The idea rehes on Shannon's famous observation that "it is possible for most purposes to treat long 
sequences as though there were just of them, each with probabihty 2^^^" |Q. The hmitations of this 
"typical sequence" approach becomes apparent when one reahzes how stringent requirement the fidehty (H) 
is. For probability distributions (diagonal density matrices), the fidelity is essentially equivalent to the Li 



distance (see for example ||Tl|, Ch. 9]). In the scheme sketched above, every sequence on Bob's hst of randomly 
chosen Q-typical ys appears with exactly the same probabiUty. Bob's state is with high probabiUty a 
uniform mixture of pure states |Vy)(^y|, where y is conditionally W-typical with respect to x. Alice's state 
px, is also with high probabihty a mixture of the same pure states |^y)(^y|, but not exactly uniform. 

Thus for formal proofs, we use a simple refinement of the method of typical sequences, known as the 
method of types Q. Two sequences over some alphabet A have the same type if each letter in A 
appears in both of them the same number of times. All sequences of the same type form a type class. 
We partition the set of typical sequences into type classes. Sequences of the same type are equiprobable 
for a DMS, and Bob can form a Ust of sequences randomly chosen from the same type class. Now he will 
be dealing with a single type class at the time rather than the entire set of typical sequences. He has to 
know which type class to choose, but AUce can send that information to him at no cost to the compression 
rate asymptotically since the number of type classes is polynomial in n. An additional benefit of using the 
method of types will be the speed of convergence to 1 of the fidelity when n — ) oo. When two or more sets 
of sequences are involved (as and y"^ above), joint and conditional types have to be considered. 
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II. Fidelity of Mixed Quantum States 

A. Fidelity and Trace Distance 

Besides computing the mixed state fidelity (|), one can measure how close state a is to state cu by- 
computing the trace distance 

D(a,w) = ^Tr|a-cu|. 

Here |A| denotes the positive square root of A^A, i.e., |A| = VAtA. The trace distance and the fidelity are 
closely related and the following holds: 



1 -F(o-,cu) < D(a,tu) < -^1 -F(cT,cu)2. (7) 
The trace distance is a metric on the space of density operators, and therefore the triangle inequality is true: 

D(ct,cu) < D(o-,t) + D(t,cu). (8) 

It has some other useful properties, as well. When we need one of those properties, we shall switch from the 
fidelity to the trace distance and back by making use of the inequalities (|^). 

Since we shall have to estimate the trace distance of a mixture of inputs, the following property, known 
as strong convexity, will be useful: Let {pt} and {qi} be probabihty distributions over some index set, and 
Wi and (Ji density operators also indexed by the same index set. Then 



D^^PiCUt.^qtCTij < D({pt},{qt}) + ^PiD(tUi,o-0. (9) 

i i i 

From strong convexity, it directly follows that the trace distance is jointly convex in its arguments: 

D^^pitUi.^Piai) < ^PiD(a)i,cTi). (10) 



All the above properties of the mixed state fidelity and trace distance and some additional are discussed in 



the excellent survey Ch. 9]. 



B. Approximating Density Matrices 



The objective of the compression algorithm described in Sec. [IV- A| is to leave Bob with states that faithfully 
approximate Ahce's. Only two types of approximations will be used, which we can already demonstrate by 
just using the above properties of the fidehty and the trace distance. 

Let a and be two density matrices, pe,TL a sequence of numbers such that Pe,n — > as n — > co, and Wn 
defined as follows: 

t^n = Pe.TLCe + (1 -Pe,n)o". 

Lemma 1: Let o and uon be as defined above. Then Ffa, Wn) — > 1 as n — > oo. 



Proof: By properties (0), and strong convexity of the trace distance (^), we have 



> 1 -^|0-Pe,n|-^|1 -(1 -Pe,n)| -D(a,cr) 

> 1 -Pe,n- 



Let y be a finite set and G Viy^) a probabihty distribution on 3^"^. Let {o-y, 7rn(u)|y G y^} be an 
ensemble of (possibly mixed) states over Hilbert space 7i^^. Consider the following density matrix 

Let Bn ^ be a probabilistically large set: UniBn) = ^ — e-n, where — > as n — > oo. It is intuitively 
clear that if we replace states Cy, y G y^ \ Bn, in the expression for On by a fixed state Oe, we obtain a 
density matrix which faithfully represents (Sn in the sense of (|) when n — > oo. To prove a slightly stronger 
result (which we shall use in Sec. |IV-C| ), we proceed as follows. 



Consider 

Let pe,n be a sequence of numbers such that pe.n — > as n — > oo, and &y, y G y"^, a set of density matrices 
such that D(ay , &y) < p^n for all y. We define a density matrix Wn as 

yeBn yey'^XSn 

Lemma 2: Let an and Wn be as defined above. Then F(CTrL, ^n) — > 1 as n — > oo. 
Proof: By properties (^), and joint convexity of the trace distance (p^, we have 



> 1 - ^ 7rn(y)D(o-y,&y) + ^ 7tn(y)D(ay,ae 



>!-(!- en/2]pe,n - en > 1 - P 



e.n ^TL- 



III. The Method of Types 

A. Types and Th/pical Sequences 

Let, as before, X he a. finite set and V[X) the set of all probabihty distributions on X. Given a sequence 
X = {xi , . . . ,Xn} G and a letter ae X, let N(a|x) denote the number occurrences of a in x. 
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Definition 1: The type of a sequence x G X"^ is the distribution Px G VIX) given by 

Px(a) = -N(a|x) for every a^X. 
n 

Conversely, the type class of a distribution P G P(^) is the set Tp of all sequences of type P in X"^: 

= {x : X G ;f ^ and Px = P}. 

The subset of ViX) consisting of the possible types of sequences x G X^ is denoted by Vn{X). It is easy to 
show by elementary combinatorics that 



|PnWI = (" + ','^^7')<(n + l]l^l 



Therefore, there is only a polynomial (in n) number of types. 
The size of Tp can be bounded as follows: 
Lemma 3: |]To|, pp. 30] For any type Px of sequences in X^ 



(n+ Ij-l-^lexplnHlPx)} < |Tp^| < exp{nH(Px)}. 



Definition 2: For any distribution P on A", a sequence x G X'^ is P-typical with constant 6 if 

1 



N(a|x) -P(a] 
n 



< 5 for every a £ X, 



and no a G ^t" with P(a) =0 occurs in x. The set of such sequences will be denoted by Tp^, and the set of 
their types by V^H^)- 



Lemma 4-' P- 34] For any distribution P on X, we have 



B. Joint and Conditional Types 

If X and 3^ are two finite sets, the joint type of a pair of sequences x G X'^ and y G is defined as a 
type of the sequence {(xi,yi), . . . , (xrL,yn)} £ X x y. Namely, it is the distribution Px,y G ViX x 3^) given 
by 

Px u(a,b) = -N(a,b!x,-y) for every a£X,h£y. 
n 

Joint types are often given in terms of the type of x and a stochastic matrix V : X —> y a.s 

Px,y(a,b) = Px(a)V(b|a) for every a G A-, b G 3^. 
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Definition 3: We say that y G 3^"^ has conditional type V given x G if 

N(a,b|x,y) =N(a|x)V(b|a) for every aeX,hey. 

For any given x G and a stochastic matrix V : ^ — > 3^, the set of sequences y G 3^"^ having conditional 
type V given x is called V-shell of x, and is denoted by Ty(x) or simply by Tv(x). The set of all conditional 
types of y G 3^ for a given x will be denoted by Vn(3^,x). 

The size of a V-shell can be bounded as follows: 

Lemma 5: [jTo], pp. 31] For any type Px of sequences in and stochastic matrix V such that Tv(x) is 
not empty: 

(n+ l)-l^ll^lexp{nH(V|Px)} < |Tv(x)| < exp{nH(V|Px)}. 



Clearly, every y in the V-shell of an x in the type class Tp^ has the same type Py: 

Py(b) = }^ Px(aMb|a). 



However, by Lemmas | and |, we immediately see that Tv(x) is "exponentially smaller" than Tp, unless all 
rows of V are equal to Py : 



(n+l)-l'^il^lexp{-nI(Px,V))< < (n + 1 exp{-nI(Px, V)}. 



(12) 



Definition 4: For any given x G X^ and a stochastic matrix W : X ^ y, sequence y G 3^"^ is W-generated 
by X (or W-typical under the condition x) with constant 6' if 

-N(a,blx,y) - -N(a|x)W(b|a) < 5' for every a£X,h£y, 
n n 

and N(a,b|x,y) = whenever W(b|a) = 0. The set of such sequences will be denoted by T^g,(x), and the 
set of their conditional types by V"^'^' [y,x). 



Lemma 6: [|10|, p. 34] For any stochastic matrix W : A" — > 3^, we have 



W-(T{V6'M|x) = l 



C. Conditional Typical States 
Let Pa be the density matrix of mixed state a given by (^). We consider px = Px, ® • • • ® Pxn for x G Tp^: 

Px=(}lW(b|xi)[Tl;b)(il;b|) •••0 (^W(b|xn)|i^b)(^^bl) 

bey hey 

yey^ 
yey^ 
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where W^djlx) denotes W(vi|xi) • . . . • W(ynlxn) and \'^y){^y\ denotes \^y, ){^y, | • • • We 
define partial density matrices Px(V) corresponding to each V-shell in Vn(3^,x) as follows: 



Now we can write Px as 



Px= Y. W^(Tv(x)|x)px(V)+ Y. W^(Tv(x)|x)px(V), (13) 

where the first term includes only the conditionally typical V-shells (W- generated by x), and the second 
term takes care of the rest. 

We are now ready to describe a mixed state compression algorithm. We shall see that for every typical 
X, the algorithm leaves Bob with mixed state Px which differs from Ahce's px of ([l^) only in the following: 
In the first term of (|l3|), Px(V) is approximated by Px(V) in the sense of Lemma ||, in the second term of 
([l3|) , px(V) is simply replaced by some fixed error-state Pe,x- Consequently, Px is approximated by ^x in the 
sense of Lemma |2|. 

IV. Mixed State Compression 

A. The Algorithm 

Ahce and Bob have identical random number generators. 

1. AUce is given a visible source sequence x G X^. 

2. For every X, Alice determines N(a|x), i.e. the type Px. 

3. If Px is not in V^^iX), i.e., x is not P-typical with constant 6, Alice sends an error indicator, and Bob 
prepares some fixed error-state Pe. Otherwise, they proceed as follows: 

4. Alice chooses a conditional type for sequence y , say V, at random with probability W^[Ty[x)). If V is 
not in V^'^' iy,x), i.e., y is not W-generated by x with constant 5', Alice sends and error indicator, and 
Bob prepares some fixed error-state pe,x- (Here Pe,x and Pe,x(V) below do not depend on x and V since 
Bob does not have that information. The notation signifies the stage in the algorithm). Otherwise, they 
proceed as follows: 

5. Alice determines type Py by computing 

Py(b) = ^Px(ambla). 

6. Ahce tells the type Py to Bob by sending log |PrL(3^)l bits identifying the particular Py. 

7. Ahce and Bob each form a hst of Ni sequences y by drawing randomly from the type class Tp^. Let 

n 

8. If there is one or more y's on the list belonging to the V-shell Tv(x), Alice sends logN^ bits to Bob 
identifying the position of first y G Tv(x) on the list, and Bob prepares |^y)(M^y|. With some probability 
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Pe,x(V), no y G Tv(x;) will be on the list that Alice and Bob form. If that is the case, Alice sends an 
error indicator and Bob prepares some fixed error-state Pe,x(V). 

B. Bob's Density Matrix 
For non-typical x, Bob's state is Pe, while for typical x, his state is given by 

^x= Y. W-((Tv(x)|x)pJV) + 

Y_ W^((Tv(x)|x)pe,x, XGT 



n 

P,6- 



Here px(V] denotes Bob's density matrix when conditional type V is chosen by Alice. Since Bob prepares 
either the error-state pe,x(V) with probability Pe,x(V), or one of the states |'4^y)('^y|, y € Tv(x), with 
probability 1 — Pe,x(V), we have 

^x(V) =Pe,x(V)Pe,x(V) + (1 -Pe,x(V)) V -J— |Wy)(^y|. 

— ' vlxj 
yeTv(X) 

Note that if pe,x(V) — > as n — > oo, then Bob's px(V) approximates Alice's px(V) in the sense of Lemma 
and thus Bob's approximates AUce's Px in the sense of Lemma |. 

To see under which conditions Pe,x(V) — ) as n — ) oo, we proceed as follows: Clearly, the probability that 
a sequence y randomly drawn from Tp^ is in Tv(x) equals to |Tv(x)|/|Tpy |. The probability Pe,x(V) that no 
such sequence is on the list of length Ni is thus equal to (1 — |Tv(x)|/|Tpy |)'^\ This quantity can be bound 
by applying the inequality (1 — x)*^ < e^^", and then the ratio |Tv(x)|/|Tp | can be bound by applying the 



inequalities (|12D : 



Pe,x(V) =(l-|Tv(x)|/|Tpy|)^^ 
<g-Ni|Tv(X)|/|Tpy 

<g-exp(n(R-I-e^))^ 

where I refers to I(Px,V) and e'^ = \X\\y\log[n + l)/n. Therefore, if 

R>I(Px,V) + e^;, (14) 

we have Pe,x(V) — > as n — > cx). 
C. Mixed State Fidelity 

We now have all we need to bound the value of Y.xeX'^ P(x)F(px, px), and thus prove the main result of 
the compression algorithm: 

Theorem 1: Let R > I(Px, V) + e'^, for all Px G Vyr[y] and all V G V^'^'[y,x). Then 

Y_ Px) ^ 1 as n ^ oo. 

xex^ 
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Proof: By Lemma |, 

= ^- Y. PMD(Px,^x)- Y. PMD(Px,Pe) 

>^- Y. PMD(px,^x)-eT^, (15) 
where = |A'|/(2rL5^). By Lemma |, 

D(Px,^x)= Y W-((Tv(x)|x)D(px(V),^x(V))+ Y W-((Tv(x]|x)D(px(V),Pex) 

vev^.^' (y ,x) VGVn {y,x]\v^.^' (y ,x) 

< Y W-((Tv(x)|x)D(px(V),^x(V)) + e;, (16) 

vev^.^'(y:K) 

where e'^ = |^||3^|/(2n5'2). By Lemma |l], 

D(Px(V],px(V)) <Pe,x(V). (17) 

Let Pe,TL denote the maximum of all Pe,x(V) over all x e Tpg and V G Vi^'^'(3^,x). Combining (|l|), (|l6|), 
and ([l7|), we obtain 

Y P(x)F(px,^x) > 1 -(1 -ej(l -e;,)Pe,Tx-(1 -eje;,-en 
xeX"^ 

^1 Pe,TL £ti £n 

As n — ) oo, we know that — > and e'-^ 0, whereas Pe,n — > when the compression rate satisfies (|l4|) 
for all X G Tpg and V G Vi^'^'(3^,x). Therefore, under the conditions of the Theorem, we have 

Y Px) ^ 1 as n ^ oo. 

xex^ 



D. Achievable Compression Rate 

To show that a compression rate of I(P, W) is achievable, we use the continuity of entropy: 
Lemma 7: If {pi}]^! and {cii}]^^ are two probability distributions such that 

N 

In. — < fl < 



^IPi-qd < e < 1 



then 



g 

|H(pi,...,pN)-H(qi,...,qN)| < -Olog— . 
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We show that for all x G T^^ and V G V^'^'{y,x), 

|I(P, W) - I(Px, V)| ^ as 6, 6' ^ 0. 

Consider 

|I(P,W)-I(P,,V)| < |H(Q)-H(Py)| + |H(W|P)-H(V|PJl 

< |H(Q)-H(Py)| + |H(W|P)-H(W|P,)| + |H(W|P,)-H(V|P,)| (18) 

< -1^113^1(6 + 6') log [(6 + 6')|;f i] + 5 log 13^1 - my\8' log 5' 

To bound the first and the third term in (|18|), we used the continuity of entropy (Lemma 0), and to bound 



the second term, we used the log 13^| bound on the entropy of any distribution over 3^. 

V. Applications 

A. The Example of Fig. |I| 

We consider the system of Example [l| as shown in Fig. [l[ Let h denote the binary entropy function: 
h[x) = — xlog(x) — (1 — x) log(l — x). 

For the classical information measures, we have 

H(Q) =log3 
H(P/W) =Ml/3) = -2/3 + log 3 
I(P,W) =2/3 

For the quantum information measures, we have 

S(P) =1 

S =h( 1/2-^3/6) 

X=l -h(l/2- V3/6) = .255... 

Note the gap between I(P, W) and x- 

B. Sources of Mixed States with Commuting Density Operators 

When the density matrices Pa, a e X, commute, they can be made diagonal in the same basis. Thus, we 
shall assume that they are mixtures of orthogonal pure states |4>b)(4'bl, b G 3^: 



Recall that 



a&x hey 



15 



Since jil^b)(4^bl are orthogonal, we have S(p) = H(Q), and HaeA- ^('^^^(Pa) = H(W/P). Therefore, the 
Holevo quantity x is in this case equal to the mutual information I(P, W): 

X = S(p) - Y_ P(a)S(Pa] = H(Q) - H(W/P] = I(P, W). 

A way to ensure that Bob's matrices commute is to assign the uniform mixture of pure states i¥y)(¥y|, 
y £ 3^"^, to each error-state in the compression algorithm: 

Pe = Pe,x = Pe.n = T_ l^y)(^yl- (19) 

Of course, no particular choice of the error-states is required if the only goal is an asymptotically good 
fidelity. However, commutativity of Bob's matrices keeps the entire system classical, makes it easier to 
derive an expression for the fidelity, and consequently puts us in a good position to recognize possible 
related problems of classical information theory. 
For each sequence x, Alice's density matrix is 

PX= ^ W-(y|x)|^y)(^y|. 

yey^ 

With assignment the compression algorithm leaves Bob with the density matrix 

Y_ W-(y|x)l^y)(^y|. 

yey^ 

Therefore, the mixed state fidehty between px and is 

F(Px,px) ={Tr[(^p^VP^)'/^]] = {Tr[(px^x)'/^]} 

={Tr[( Y_ W-(y|x)W-(y|x)|Wy)(^y|)'/^]}^ 

yey^^ 

= [ Y_ \Jw^(y\x)-W^(y\x)'^^. 
yey^ 

VI. Connections with Classical Problems 

We discuss three problems of classical information theory, each to a certain degree related to the problem 
of visible mixed state compression. 

A. Sources of Probability Distributions 

We consider a discrete memoryless source whose alphabet is a set of \X\ coins with |3^| faces. When coin 
Ca is tossed, face b appears with probability W(b|a), a ^ X, h e y. The source, Alice, produces sequences 
of coins, i.e., probabihty distributions, where each coin is drawn independently according to the probability 
distribution P. A source whose alphabet consists of two probability distribution is described in the following 
example: 

Example 2: A source of two biased coins is shown in Fig. |. If coin Ci is tossed, the probabihty of getting 
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a tail is w, if coin Cz is tossed, the probability of getting a head is w. 

When the u coins in Ahce's sequence Cx = {Cx, , . . . , Cx„} are tossed, the probabihty of getting sequence 
y G of faces is W"^(tj|x) = W(yi|xi) • . . . • Wtyril^TL]- Each time Alice is given sequence of coins Cx, the 
reproducing source Bob prepares sequence of faces y G with probability W"^(y |x) such that 

Y_ P(x)Fyn(W-(.|x),W-(.|x))^l as n — > oo. (20) 
Here Vyn[-,-) is the Bhattacharyya-Wooters overlap between two probability distributions over the set y^: 



F3;n(W-(.|x),W-(-|x)) =[ Y_ \/W^{y\x)-WHy\x) 

yey^ 



Requirement (|20D , ensures that Alice and Bob appear to be identical sources of probability distributions to 
an observer who can see only the sequences of faces at both ends. More precisely, with probability approaching 
1 as TL increases, such observer can not tell the difference between Alice and Bob. We immediately see that 



goal (poD can be achieved by running the compression algorithm described in Sec. |IV. 
B. Type Covering 

We again consider the source of the previous section, whose alphabet is a set of \X\ coins with |3^| faces. 
But now, for each Alice's sequence Cx of coins. Bob prepares a predetermined sequence y{x) of faces such 
that 

Y_ P(^)f-Yxi;(PxW(-|-),Px,y(x)) ^ 1 asn^oo. 
xeA'i^ 

Here ^xxyi,-) is the Bhattacharyya-Wooters overlap between two probability distributions over the set 

Xxy: 

2 



F;tx3;(PxW(-|-),Px,y(x)) =[ Y. ^Px(a)W(b|a)-Px,y(x)(a,b)J (21) 

{a,h)GXxy 

= [ Y. ^ (a|x]W(b|a) • N (a, b|x, y (x) )] \ 

(a,b)eXxy 

This problem was translated into a rate distortion one, and solved for perfect and imperfect asymptotic 
fidelity in [||. By using only simple combinatorial techniques, we will show that I(P, W) is the optimal rate 
for perfect asymptotic fidelity. 
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We first show that the overlap (|2l|) is close to 1 if and only if y (x) is W-generated by x with some constant 
5 close to 0. We prove this claim in the following two lemmas by using the inequahties (|^), which bound the 
fidelity in terms of the trace distance and vice versa. 

Lemma 8: Let y(x) be W-generated by x with constant 6: 



-N(a,b|x,y) - -N(a|x)W(b|a) 
n n 



< 6 for every a G ^, b G 3^. (22) 



Then 

F;txy(PxW(-|-),Px,y(x)) ^1, as 6 ^ 0. 



Proof: By (0), we can bound the Bhattacharyya-Wooters overlap (|2l|) in terms of the corresponding 
trace distance: 

FA^xy(PxW(-|-),Px,y(x)) =[ Y. ^v/N(Q|x)W(b|a)-N(a,b|x,y(x)) 



>1-^ y_ |-N(Q,bix,y)-lN(a|x)W(b|a) 
z ^ — I n n 

(Q.bjGA-xy 



Because of (|22D , we have 



F;^xy(PxW(-|-),Px,y(x)) > 1 - S-IA-IID^IA 



Lemma 9: Let Bhattacharyya-Wooters overlap (|2]]) be equal to 1 — cx/2. Then sequence y(x) is W- 
generated by x with constant l^/oi. 

Proof: By (0), we can bound the trace distance between the distributions PxW(-|-) and Px,y(x) in terms 



of their Bhattacharyya-Wooters overlap (pil) : 



\ L |^N(a,bix,y)-lN(a|x)W(b|a)| < [1 -FL3;(PxW(-|-)Px,y(x))]^/^< V^. 



It follows that 



-N(a,b|x,-y) - -!-N(a|x)W(b|a) <2^/o^. for every a£X,hey. 
n n 



Therefore, for a given x, the fidelity (^) is close to 1 if and only if y [x] is W-generated by x with some 
constant 6 close to 0. A compression code C C y"^ will have to contain at least one such y{x) for each 
X G Tp6^, as shown next. 

Definition 5: We shall say that code C C 3;"^ of face sequences covers set B C X'^ of coin sequences 
with constant 6 if it contains at least one element of T^^ix), for each x e B, i.e., for each x £ B, we have 

CnT-Jx)^0. 
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Theorem 2: Let C be a code which covers the set Tp^^. For each x G X^, let ylx) be an element of 
C n T{^g(x) if X G Tp6^, and y(x) be an arbitrary yg G 3^"^, otherwise. Then 

Y_ PMT'^txi^lPxWl-IO.Px.yix)) ^ 1 asn^oo. 



Proof: By Lemmas ^ and |8|, 



Y_ PMT'A'x3;(PxW(-|-),Px,y(x)) > (l 



XgA''^ 



We assume that Ahce and Bob both know the compression code C. To identify yM, Alice has to send to 
Bob log \C\ bits of classical information. The compression rate is, therefore, given by 

loglCI 



R 



n 



and is determined by the size of the smallest code C that covers Tp ^. To bound the size of C, we shall use the 
following simple general result about coverings, known as Johnson-Stein-Lovasz Theorem (see for example 
H, p. 322]): 

Theorem 3: Let A be a — 1 matrix with N rows and M columns. Assume that each row contains at 
least V ones and each column at most a ones. Then there exists an N x K submatrix C of A with 

N M, M,, , 

K < — H loga < —(1 +loga) 

a V V 

such that C contains no all-zero rows. 

In order to use Theorem | in bounding compression code rate R, we construct matrix A as follows: The 
rows of A are indexed by sequences x that are P-typical with constant 5x, columns by sequences y that are 
Q-typical with constant 8y. Thus A has ITpg^l rows and ITq ^^I columns. An element of A in row x and 
column y is set to 1 if x and y are jointly typical with constant S^y, i.e., if 

|P(a]W(b|a) - -"-Nla.blx.y)! = |Q(a)U(b|a) - lN(a,b|x,y)| < 5^^, 
n n 

otherwise to 0. We first show that all ys corresponding to the Is in a particular row x are W-generated by 
X with constant 6 = 6x + 6xy: For each row x having a 1 in column y, we have 

|Px(b)W(b|a) - lN(a,b|x,y)| < |Px(b)W(b|a) - P(a)W(b|a)| + |P(a)W(b|a) - -"-Nta.blx.y)! 
u n 

< Sx+ 6xy = 6. 

Therefore y G T^gfx). Since C is a submatrix of A with no all-zero rows, the set C of sequences y 
indexing the columns of C covers the set Tp^^. Therefore, by Theorem ||, C can serve as a compression code 
asymptotically achieving perfect fidehty. 
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We find v, a lower bound to the number of 1 's in each row as follows: For each x, consider all sequences 
y which are W-generated by x with constant 8'. If 6' is set to be equal to 6xy — ^x, we have 

|P(a)W(b|a) - lN(a,b|x,y)| < |P(Q)W(b|a) - PJa)W(b|a)| + |Px(a)W(b|a) - lN(Q,b|x,y)| 
n n 

< 5x + 6' = 6xy 

Thus, if y € '^W5'M, the element of A in row x and column y is set to 1. Therefore, the number of I's in 
each row is at least v: 

v = exp[n(H(W/P) -e'). 

We find a, an upper bound to the number of 1 's in each column as follows: For each column y having a 
1 in row x, we have 

|Py(b)U(a|b) - lN(a,b|x,y)| < |Py(b)U(a|b) - Q(b)U(a|b)| + |Q(b)U(a|b) - lN(a,b|x,y)| 
n u 

< 5y + 5xy. 

Therefore x G T]^6"(y), ^" = + ^xy, and thus the number of I's in each column is at most a: 

a = exp[n(H(U/Q) +e"]. 
Theorem § gives an upper bound on K, the number of columns in C and thus the code rate 



TL 



Since M = |T{^^6yl < exp[n(H(Q) + ey)], we have 

M 

K< — (1 +loga) 

V 

< exp[n(H(Q) - H(W/P) + ey + e')] • [1 + n(H(U/Q) + e"]]. 
Now, for any N x K submatrix of A with no all-zero rows, we have K • a > N • 1 , and thus 

a 

> exp[n(H(P) - H(U/Q) + - e")]- 
Therefore, the compression R is bounded by 

I(P, W) + e« - e" < R < I(P, W) + + e' + '^-^"^Q) + .'')! 



n 



where ex, ey, e , e — > 0, as n — > oo, and the compression rate I(P, W) is asymptotically optimal. 

Let us now compare the compression problem in this section with the earher one in Sec. |VI-A| . In the earlier 



case, for each AUce's sequence of coins x. Bob most hkely chooses one of the approximately exp{nH(W/P)} 
sequences of faces W-generated by x, each one with roughly the same probability. In the case we just 
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considered, for each Alice's x, Bob's sequence of faces will always be a fixed sequence y(x), W-generated by 
X. Note that in both cases, after aye '^w^M has been identified for Bob, his uncertainty about x reduces 
from H(P) to H(U/Q); hence the same compression rate. 

To an observer who can see only the sequences of faces at both ends, Ahce and Bob now do not appear to 
be identical sources for any rate of compression R smaller than the entropy H(Q): Bob has about exp(nR) 
different, equally likely face-sequences of length u whereas Alice has about exp(rLH(Q]). For each AHce's x, 
for quantum transmission or storage, Bob can prepare the quantum state |M'y)(¥y[ instead of the sequence 



of faces y. In the scenario of Sec. |VI-A| , his state is roughly a uniform mixture of pure states [Vy)(Vy| where 



each y is W-generated by x, whereas in the scenario of this section, his state is the pure state |^y(x))(^v(x)l- 

C. Channel Coding and Lossy Mixed State Compression 

The Bhattacharyya distance is in classical information theory most commonly known for its role in bound- 
ing the error-probability of a discrete memoryless channel (DMC): Consider a DMC with input alphabet X, 
output alphabet y, and transition probabilities W(b|a), a e X, h e y. When sequence x G has been 
transmitted, the probabihty that the maximum likehhood detector finds sequence x' G X'^ more hkely is 
smaller than 



Y_ ^Jw^[y\x)W^[y\x'). 
yey^ 

This bound is known as the Bhattacharyya bound and its negative logarithm as the Bhattacharyya distance 
between sequences x and x'. The probability of error for the maximum Hkelihood decoder can then be 
bounded in terms of the rate of the channel code used for transmission. One way to derive such bound is 
by solving a special rate distortion problem. We state the problem below and describe its connection with 



particular lossy mixed state compression. For its apphcation to channel coding, we refer the reader to 
or textbooks [jlO|, pp. 185, 193] and ^ pp. 408-410]. 

Consider a lossy mixed state compression problem where both the original source Alice and the reproduc- 
tion source Bob have the same alphabet X. The fidelity between sequences x and x' is the Bhattacharyya- 
Wooters overlap between W^[-\x) and W^(-|x'): 

F(x,x') = [ Y. Vw^(ij|x)W^(-y|x')]^. (23) 

yey^^ 

Let C C X^ be a reproduction code. We encode source sequence x G X'^ by choosing the codeword it which 
maximizes the fidelity F(x,^]. Let F(x|C) denote this maximum fidelity: 

F(x|C) =maxF(x,«), 

and F(C), the expected fidehty achieved with code C: 
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We are interested in finding out how the fidelity F(C) depends on the rate of code C. 

We get the answer to the question through the following rate distortion problem. Let again source and 
reproduction alphabet be X. Define a single-letter distortion measure between a source letter a and a 
reproduction letter a' to be the Bhattacharyya distance between the letters: 

dw(a, a') = - log ^ ^yW(b\a)W[h\a'), a,a' e X. 
bey 

To make the distortion finite, we shall assume that any two coins have at least one common face, i.e., far 
all a, a' £ X, there is a b G 3^ such that W(b|a) > and W(b|a') > 0. Thus, we have 

< dw(a, a') < do, a, a' G X. 

Because of our assumption that W has no identical rows, dw(<i, a'] = iff a = a'. 

The distortion between sequences is the average of the per letter distortion between elements of the 
sequences: 

dw(x, x') = - d(xt, x'O = - - log n ^W(b|xt)W(bix9" 

= --log Y_ \/wT^(-y|x)W^(y|x'), x,x' e 

The fidehty (|2^) is therefore given by 

F(x,x'] = exp(— 2ndw(x,x')). 

Note that if the distortion between two sequences remains strictly positive as n increases, the fidehty between 
them approaches 0. 

Let C C X"^ be a reproduction code. We encode source sequence x G X"^ by choosing the codeword it that 
minimizes the distortion d(x,^). Let d(x|C) denote this minimum distortion: 

d(x|C) = mind(x,1^), 

Hec 

and d(C), the expected distortion achieved with code C: 

d(C) = Y_ P(x)d(x,C). 

xeA'i^ 

Let V be an \X\ x \X\ stochastic matrix with elements Waa' = V(a'|a), i, ci' G X, and let 

d(V) = Y_ P(a)V(a'|a)d(a,a'). 

a,a!eX 

be the average distortion associated with V. The rate distortion function of a DMS with generic distri- 
bution P is given by 



R(D) = max I(P,V). 

V:d(V)<D 
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Its significance, found by Shannon in [|l5[, is expressed by the source coding theorem and its converse (see 
[p^ pp. 397-400] for the form used here). Before stating the theorem and its apphcation to our problem, we 
compute the distortion measure dw{-, ■) and the rate distortion function for the source of Example |. 
Example 3: Consider again the source shown in Fig. |2[ We have 

dw(Ci,C2) = - log V4p(1 -p] and dw(x,x') = ;^Dh(x,x') • dwlCi.Ci), 

where Dh(x;,x') is the Hamming distance between sequences x and x'. The rate distortion function is given 
by 

R(D) = H(P)-H(D/dw(Ci,C2)). 
Note that R(0) = H(P), which is true in general under our assumptions. 

Theorem 4- 0' PP- 397-400] Source Coding Theorem and its Converse 
For any block length n and rate R, there exists a block code C C X"^ with average distortion d(C) satisfying 

d(C) < D + doe-^^C^'^', 

where E(R, D) > for R > R(D). Conversely, no source code for which d(C) < D has rate smaller than R(D). 



We use this result to show how the fidehty F(C) depends on the rate of code C: 

Theorem 5: For any block length n and rate R > H(P), a rate R block code C C exists such that the 
fidelity F(C) ^1 as n ^ 0. Conversely, for any code C C with rate R < H(P), Y[C) ^ as n ^ 0. 
Proof: By the Source Coding Theorem |^, we have 

= Y_ P(x)exp(2rLdw(x|C)) 

XgA"" 

> Y_ -ndw(xlC)) 
>1 -ndoe-^^C^'"' 

where E(R, 0) > for R > R(0) = H(P). Therefore, the fidelity can be made arbitrarily close to 1 by increasing 
the block length n. 

By the Converse to the Source Coding Theorem |, no code for which d(C) < has rate smaller then H(P). 
Thus for R < H(P), we have d(C) = D > 0. Therefore, the distortion d(x|C) remains strictly positive as 
n increases for a probabihstically large set of sequences x. Consequently for the same set the fidehty Fx|C) 
approaches 0. ■ 
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